Skip to content

Add support for tracking parent of CodebaseResource entries and ensure top level paths are stored #1691

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

aayushkdev
Copy link
Collaborator

@aayushkdev aayushkdev commented Jun 13, 2025

fix #1687

Summary/Goals of this change
This PR introduces an parent_directory_path field to the CodebaseResource model to optimize fetching children of a directory. The existing .children() method which was intended to fetch children uses regex filtering on path, which is slow for large codebases.
Additionally, this PR also ensures that top-level paths are stored during the resource creation step which is essential for rendering the root level nodes in the file tree

Breaking Changes
The file tree will rely on the newly introduced ancestor field and the presence of top-level paths. As a result, existing projects scanned before this PR will need to be re-scanned to work with the file tree view.

Open Questions
Should we deprecate or remove the following methods now that ancestor handles relationship resolution more efficiently?

  • parent_path()
  • has_parent()
  • parent()

Would love feedback on this!

Copy link
Contributor

@tdruez tdruez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces an ancestor field to the CodebaseResource model to optimize fetching children of a directory. The existing .children() method which was intended to fetch children uses regex filtering on path, which is slow for large codebases.
Additionally, this PR also ensures that top-level paths are stored during the resource creation step which is essential for rendering the root level nodes in the file tree

Could you explain and document, with data and examples, how storing the parent_path on the resource, would be used in the rendering context?
No need to add code, just as discussion in the PR as a start would be useful.


Should we deprecate or remove the following methods now that ancestor handles relationship resolution more efficiently?

Let's wait a bit to ensure we have a working system before deprecating existing code ;)


See my comment about the code.
It is missing unit test to ensure the new field is properly set (with regular path, and with root).
Also a unit test to cover the changes made to rootfs.py will be appreciated.

@aayushkdev
Copy link
Collaborator Author

Use of parent_directory_path in the Rendering Context

  • Fetching children of a directory is now a simple filter on the parent_directory_path field, which is indexed and fast. where previously the .children() was used method that used regex based filtering which was slow and inefficient

  • Adding the parent_directory_path field also makes determining whether a directory has children easier as this can now be handled via a single annotated query that counts children per resource. This eliminates the need for per-directory .exists() checks, which gets very expensive with hundreds or thousands of folders.

  • Top-level resources (those with parent_directory_path=None) can be queried directly and efficiently, making the initial rendering of the tree straightforward and performant.

Step by step flow

README.md
src/
  └── main.py
  └── utils/
       └── helpers.py
data/
  └── input.txt
bin/

1. Initial Load:

  • The codebase fetches the top-level directories/files using:
CodebaseResource.objects.filter(project=project, parent_directory_path=None)
  • This fetches: README.md, src/, data/, and bin/, as they have their parent_directory_path field set to None.

2. Determining Expandability

  • Another query will be fired using parent_directory_path=None for fetching and annotating an extra field has_children.
  • This will indicate whether a directory can be expanded in the UI.
  • In this case:
  • bin/ and README.md have no children => has_children = False
  • src/ and data/ have children => has_children = True

3. Expanding a Folder (e.g., src/):

  • Clicking on the src/ folder sends a request with:
CodebaseResource.objects.filter(project=project, parent_directory_path="src")
  • This fetches its children: main.py and utils/, as their parent is src/ .

4. Determining Children of Fetched Items:

  • Like before another query is fired to check if these items have children.
  • In this case:
  • main.py => has_children = False
  • utils/ => has_children = True

@aayushkdev aayushkdev requested a review from tdruez June 14, 2025 19:05
tdruez added a commit that referenced this pull request Jun 16, 2025
In preparation of adding parent_path as a field #1691

Signed-off-by: tdruez <tdruez@nexb.com>
tdruez added a commit that referenced this pull request Jun 16, 2025
In preparation of adding parent_path as a field #1691

Signed-off-by: tdruez <tdruez@nexb.com>
@tdruez
Copy link
Contributor

tdruez commented Jun 16, 2025

@aayushkdev Thanks for all the implementation details!

I've renamed the old parent_path function to parent_directory in #1694 so it's not in the way of the new field.
Could you merge those into your branch and define parent_directory_path as parent_path before the merge?

@aayushkdev
Copy link
Collaborator Author

Hey @tdruez I have merged your changes and renamed the required fields.

Copy link
Contributor

@tdruez tdruez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aayushkdev Looking good, see my suggestion for improvements.

@tdruez
Copy link
Contributor

tdruez commented Jun 17, 2025

Breaking Changes
The file tree will rely on the newly introduced ancestor field and the presence of top-level paths. As a result, existing projects scanned before this PR will need to be re-scanned to work with the file tree view.

A data migration could solve this. The question is how fast could we migrate the data on existing databases that include millions of CodebaseResource records.

@aayushkdev
Copy link
Collaborator Author

Yes, iterating through all the resources in CodebaseResource to add the parent_path value to each resource would be quite slow, especially with multiple projects. If you'd like, I can prepare a migration script to test this out and see how it affects performance.

@tdruez
Copy link
Contributor

tdruez commented Jun 17, 2025

I don't think a "script" is the right approach here. We need to look into a raw SQL approach for acceptable performances.

@aayushkdev
Copy link
Collaborator Author

Yeah, I think this will be doable. I’ll explore this further and test it out.

@aayushkdev
Copy link
Collaborator Author

aayushkdev commented Jun 20, 2025

Hey @tdruez, I’ve got the migration for setting parent_path working. However, I could use your input on the migration for inserting the missing root directories. I initially assumed I could just insert the path, project_id, and parent_path fields only which would be doable through sql, but it turns out there are other non-nullable fields like sha1 and md5 that typically require a scan to populate which isnt possible with pure sql. One workaround could be to use dummy values for these fields. What do you think?

@tdruez
Copy link
Contributor

tdruez commented Jun 20, 2025

I initially assumed I could just insert the path, project_id, and parent_path fields only which would be doable through sql

Why would you have to insert those field? We are talking about generating a value for parent_path on existing CodebaseResource. Those existing records already have a path and a project_id.

I’ve got the migration for setting parent_path working.

Can you paste your progress on this migration in a comment, it'll be easier to provide you feedback.

@aayushkdev
Copy link
Collaborator Author

aayushkdev commented Jun 20, 2025

Why would you have to insert those field? We are talking about generating a value for parent_path on existing CodebaseResource. Those existing records already have a path and a project_id.

This migration has two parts first, adding the parent_path field to each resource, which I’ve already completed and second, inserting new resources with top-level root paths into CodebaseResource for each project. This is necessary because pipelines like map_deploy_to_develop don’t store root-level directories like from and to in the CodebaseResource, so we need to add that with the migration.

Can you paste your progress on this migration in a comment, it'll be easier to provide you feedback.

from django.db import migrations

class Migration(migrations.Migration):

    dependencies = [
        ('scanpipe', '0073_codebaseresource_parent_path'),
    ]

    operations = [
        migrations.RunSQL(
            """
            UPDATE scanpipe_codebaseresource
            SET parent_path = regexp_replace(path, '/[^/]+$', '')
            WHERE path LIKE '%/%';
            """
        ),
    ]

@aayushkdev
Copy link
Collaborator Author

Hey @tdruez @pombredanne , I explored different ways to store trees (like MPTT, Treebeard and ltree), but they all involve more complex integration to scancode.io, introduce even more additional fields, or require us to change the way Resources are created which slows down creation of resources, all while offering a minimal performance benefit over the parent_path field approach. Introducing the parent_path field provides a more lightweight and efficient solution that integrates cleanly with our existing path logic.

I also tested the parent_path field approach on deeply nested trees, and performance remains strong even with thousands of resources.

So I don't think its necessary to change our approach on how we handle the storage of trees.

VarshaUN pushed a commit to VarshaUN/scancode.io that referenced this pull request Jul 1, 2025
aboutcode-org#1694)

In preparation of adding parent_path as a field aboutcode-org#1691

Signed-off-by: tdruez <tdruez@nexb.com>
Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
…ource

Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
…h` is always set

Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
…one to align with the code format

Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
@aayushkdev aayushkdev force-pushed the 1687-codebaseresource-parent-and-top-paths branch from 7928f3c to 2b5b2f7 Compare July 12, 2025 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Add support for tracking parent of CodebaseResource entries and ensure top level paths are stored
2 participants